The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.
# To help with reading and manipulating data
import pandas as pd
import numpy as np
# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
)
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# Random undersampler for under sampling the data
from imblearn.under_sampling import RandomUnderSampler
# Use MICE since income data and marital status are likely to have been withheld on purpose
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Appending models into the list
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# This will help in making the Python code more structured automatically (good coding practice)
# %load_ext nb_black
df_raw = pd.read_csv('BankChurners.csv')
df = df_raw.copy()
df.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
df.tail()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10122 | 772366833 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | 3 | 2 | 3 | 4003.000 | 1851 | 2152.000 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 710638233 | Attrited Customer | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | 4 | 2 | 3 | 4277.000 | 2186 | 2091.000 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 716506083 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | 5 | 3 | 4 | 5409.000 | 0 | 5409.000 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 717406983 | Attrited Customer | 30 | M | 2 | Graduate | NaN | $40K - $60K | Blue | 36 | 4 | 3 | 3 | 5281.000 | 0 | 5281.000 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 714337233 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | 6 | 2 | 4 | 10388.000 | 1961 | 8427.000 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.000 | 739177606.334 | 36903783.450 | 708082083.000 | 713036770.500 | 717926358.000 | 773143533.000 | 828343083.000 |
| Customer_Age | 10127.000 | 46.326 | 8.017 | 26.000 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.000 | 2.346 | 1.299 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.000 | 35.928 | 7.986 | 13.000 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.000 | 3.813 | 1.554 | 1.000 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.000 | 2.341 | 1.011 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.000 | 2.455 | 1.106 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.000 | 8631.954 | 9088.777 | 1438.300 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.000 | 1162.814 | 814.987 | 0.000 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.000 | 7469.140 | 9090.685 | 3.000 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.000 | 0.760 | 0.219 | 0.000 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.000 | 4404.086 | 3397.129 | 510.000 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.000 | 64.859 | 23.473 | 10.000 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.000 | 0.712 | 0.238 | 0.000 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.000 | 0.275 | 0.276 | 0.000 | 0.023 | 0.176 | 0.503 | 0.999 |
df.describe(include=["object"]).T
| count | unique | top | freq | |
|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 |
| Gender | 10127 | 2 | F | 5358 |
| Education_Level | 8608 | 6 | Graduate | 3128 |
| Marital_Status | 9378 | 3 | Married | 4687 |
| Income_Category | 10127 | 6 | Less than $40K | 3561 |
| Card_Category | 10127 | 4 | Blue | 9436 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
df.shape
(10127, 21)
# duplicate values in the data
df.duplicated().sum()
0
# missing values in the data
print(df.isnull().sum() )
print( round(df.isnull().sum() / df.isnull().count() * 100, 2))
CLIENTNUM 0 Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 CLIENTNUM 0.000 Attrition_Flag 0.000 Customer_Age 0.000 Gender 0.000 Dependent_count 0.000 Education_Level 15.000 Marital_Status 7.400 Income_Category 0.000 Card_Category 0.000 Months_on_book 0.000 Total_Relationship_Count 0.000 Months_Inactive_12_mon 0.000 Contacts_Count_12_mon 0.000 Credit_Limit 0.000 Total_Revolving_Bal 0.000 Avg_Open_To_Buy 0.000 Total_Amt_Chng_Q4_Q1 0.000 Total_Trans_Amt 0.000 Total_Trans_Ct 0.000 Total_Ct_Chng_Q4_Q1 0.000 Avg_Utilization_Ratio 0.000 dtype: float64
# Types of entries in each column
df.nunique()
CLIENTNUM 10127 Attrition_Flag 2 Customer_Age 45 Gender 2 Dependent_count 6 Education_Level 6 Marital_Status 3 Income_Category 6 Card_Category 4 Months_on_book 44 Total_Relationship_Count 6 Months_Inactive_12_mon 7 Contacts_Count_12_mon 7 Credit_Limit 6205 Total_Revolving_Bal 1974 Avg_Open_To_Buy 6813 Total_Amt_Chng_Q4_Q1 1158 Total_Trans_Amt 5033 Total_Trans_Ct 126 Total_Ct_Chng_Q4_Q1 830 Avg_Utilization_Ratio 964 dtype: int64
# Making a list of all categorical variables
cat_col = [
"Attrition_Flag",
"Gender",
"Education_Level",
"Marital_Status",
"Income_Category",
"Card_Category",
]
# Printing number of count of each unique value in each column
for column in cat_col:
print(df[column].value_counts())
print("-" * 40)
Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 ---------------------------------------- F 5358 M 4769 Name: Gender, dtype: int64 ---------------------------------------- Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 ---------------------------------------- Married 4687 Single 3943 Divorced 748 Name: Marital_Status, dtype: int64 ---------------------------------------- Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64 ---------------------------------------- Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 ----------------------------------------
Categorical Variables: "Attrition_Flag", "Gender", "Education_Level", "Marital_Status", "Income_Category", "Card_Category",
Questions:
How is the total transaction amount distributed?
Total_Trans_Amt - the distribution appears uneven and has 4 humps, and it is right skewed distribution. There are a lot of outliers on the right-hand side. The max value of transaction amount for the attrition cases is even less than the median value of transaction amount for the non-attrition cases. Max Transaction Amt is 18484, and 75% are less than 4741
What is the distribution of the level of education of customers?
Education_level: 40% are graduate or above. 15% are uneducated. There are a handful of outliers on the right hand side. 30% are graduate. 15% uneducated. The attrition percentage is slightly higher for the post-graduates and doctorates, than the rest (for the rest the attrition percentage looks the same)
What is the distribution of the level of income of customers?
There are 2 outliers. The median and the mean appear the same. The distribution is slightly right-skewed. Mean is '80k-120k' though highest percentage of people are in the 'less than 40K' range. Attrition percentage is least in 60k-80k category and the most in 120K+ catgeory
How does the change in transaction amount between Q4 and Q1 (total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?
total_ct_change_Q4_Q1 - the median value of total_ct_change_Q4_Q1 is higher in case of non-attrition cases than in the attrition cases. The min value of total_ct_change_Q4_Q1 is also higher in the non-attrition cases. The distribution appears normal balanced shape and the mean and median appear roughly the same, though there are outliers on both sides.
How does the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?
The attrition percentage was very high, ie, almost 50% in the months where the customers were very active, ie, where they were inactive for 0 months. Attrition decreases as the customers are inactive for longer.
What are the attributes that have a strong correlation with each other?
100% Correlation - Avg_open_to_buy and Credit_limit
Total_trans_amt and Total_trans_ct = 81% correlation
Months_on_book and Customer Age - 79% correlation
Avg_Utilisation_ratio and Total_revolving_bal - 62% correlation
Also a 54% negative correlation between Avg_Utilisation_ratio and Avg_open_to_buy
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=True, bins=None, only_box=False):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
if only_box == False:
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
# count = data[feature].nunique()
# if n is None:
# plt.figure(figsize=(count + 1, 5))
# else:
# plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
df.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
# Basic data pre-processing for analysis
df = df.drop(["CLIENTNUM"], axis=1)
# Move into 0s and 1s
df["Attrition_Flag"].replace("Attrited Customer", 1, inplace=True)
df["Attrition_Flag"].replace("Existing Customer", 0, inplace=True)
## Univariate analysis
numerical_columns = df.select_dtypes(include='number').columns.tolist()
categorical_columns = df.select_dtypes(include='object').columns.tolist()
# Create a boxplot for each column in a for loop
for column in numerical_columns:
plt.figure() # Create a new figure for each boxplot
histogram_boxplot(df,column)
plt.title(f'Boxplot for {column}')
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
for column in numerical_columns:
plt.figure() # Create a new figure for each boxplot
plt.title(f'Barplot for {column}')
labeled_barplot(df, column, perc=True, n=10)
# Repeat for categorical columns
for column in categorical_columns:
plt.figure() # Create a new figure for each boxplot
plt.title(f'Barplot for {column}')
labeled_barplot(df, column, perc=True, n=10)
#Bivariate
cols_list = df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(12, 7))
sns.heatmap(
#Cust_Data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
df[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
# selecting numerical columns
num_col = df.select_dtypes(include=np.number).columns.tolist()
sns.pairplot(data=df[num_col], diag_kind="kde")
plt.show()
distribution_plot_wrt_target(df, "Total_Revolving_Bal", "Attrition_Flag")
distribution_plot_wrt_target(df, "Credit_Limit", "Attrition_Flag")
distribution_plot_wrt_target(df, "Customer_Age", "Attrition_Flag")
distribution_plot_wrt_target(df, "Total_Trans_Ct", "Attrition_Flag")
distribution_plot_wrt_target(df, "Total_Trans_Amt", "Attrition_Flag")
distribution_plot_wrt_target(df, "Total_Ct_Chng_Q4_Q1", "Attrition_Flag")
distribution_plot_wrt_target(df, "Avg_Utilization_Ratio", "Attrition_Flag")
distribution_plot_wrt_target(df, "Months_on_book", "Attrition_Flag")
distribution_plot_wrt_target(df, "Total_Revolving_Bal", "Attrition_Flag")
distribution_plot_wrt_target(df, "Avg_Open_To_Buy", "Attrition_Flag")
distribution_plot_wrt_target(df, "Total_Revolving_Bal", "Attrition_Flag")
stacked_barplot(df, "Gender", "Attrition_Flag")
Attrition_Flag 0 1 All Gender All 8500 1627 10127 F 4428 930 5358 M 4072 697 4769 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Marital_Status", "Attrition_Flag")
Attrition_Flag 0 1 All Marital_Status All 7880 1498 9378 Married 3978 709 4687 Single 3275 668 3943 Divorced 627 121 748 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Education_Level", "Attrition_Flag")
Attrition_Flag 0 1 All Education_Level All 7237 1371 8608 Graduate 2641 487 3128 High School 1707 306 2013 Uneducated 1250 237 1487 College 859 154 1013 Doctorate 356 95 451 Post-Graduate 424 92 516 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Income_Category", "Attrition_Flag")
Attrition_Flag 0 1 All Income_Category All 8500 1627 10127 Less than $40K 2949 612 3561 $40K - $60K 1519 271 1790 $80K - $120K 1293 242 1535 $60K - $80K 1213 189 1402 abc 925 187 1112 $120K + 601 126 727 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Contacts_Count_12_mon", "Attrition_Flag")
Attrition_Flag 0 1 All Contacts_Count_12_mon All 8500 1627 10127 3 2699 681 3380 2 2824 403 3227 4 1077 315 1392 1 1391 108 1499 5 117 59 176 6 0 54 54 0 392 7 399 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Months_Inactive_12_mon", "Attrition_Flag")
Attrition_Flag 0 1 All Months_Inactive_12_mon All 8500 1627 10127 3 3020 826 3846 2 2777 505 3282 4 305 130 435 1 2133 100 2233 5 146 32 178 6 105 19 124 0 14 15 29 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Total_Relationship_Count", "Attrition_Flag")
Attrition_Flag 0 1 All Total_Relationship_Count All 8500 1627 10127 3 1905 400 2305 2 897 346 1243 1 677 233 910 5 1664 227 1891 4 1687 225 1912 6 1670 196 1866 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Months_Inactive_12_mon", "Attrition_Flag")
Attrition_Flag 0 1 All Months_Inactive_12_mon All 8500 1627 10127 3 3020 826 3846 2 2777 505 3282 4 305 130 435 1 2133 100 2233 5 146 32 178 6 105 19 124 0 14 15 29 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Dependent_count", "Attrition_Flag")
Attrition_Flag 0 1 All Dependent_count All 8500 1627 10127 3 2250 482 2732 2 2238 417 2655 1 1569 269 1838 4 1314 260 1574 0 769 135 904 5 360 64 424 ------------------------------------------------------------------------------------------------------------------------
# Before using Iterative Imputer, convert object -> category datatype
# Convert object-type columns to the category datatype - to make it computationally-effective
for col in cat_col:
df[col] = df[col].astype('category')
# Can drop Avg_Open_to_Buy since it is 100% correlated with Credit_Limit as seen in the heat_map.
X = df.drop(["Avg_Open_To_Buy"], axis=1)
# Removed CLIENTNUM as it's useless
X = df.drop(["Attrition_Flag"], axis=1)
y = df["Attrition_Flag"]
# Replace abc with NAN -- DO BEFORE SPLIT
df["Income_Category"].replace('abc', np.nan, inplace=True)
# outlier detection using boxplot
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# Check for OUTLIERS MUST be done before data split
## Check for Outliers
## if cannot be dropped,just use any transformations such as log or square root
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
# 75%-25%
IQR = Q3 - Q1
# Finding lower and upper bounds for all values. All values outside these bounds are outliers
lower = (Q1 - 1.5 * IQR)
upper = (Q3 + 1.5 * IQR)
# checking the % outliers
print(((df.select_dtypes(include=["float64", "int64"]) < lower) | (df.select_dtypes(include=["float64", "int64"]) > upper)).sum() / len(df) * 100)
# Future Improvement: Apply log trasnformations on right-skewed data and sqrt transformations on left-skewed data
# Future: Ensure to do separately on Train, Val and Train datasets . SO do this later
Customer_Age 0.020 Dependent_count 0.000 Months_on_book 3.812 Total_Relationship_Count 0.000 Months_Inactive_12_mon 3.268 Contacts_Count_12_mon 6.211 Credit_Limit 9.717 Total_Revolving_Bal 0.000 Avg_Open_To_Buy 9.509 Total_Amt_Chng_Q4_Q1 3.910 Total_Trans_Amt 8.848 Total_Trans_Ct 0.020 Total_Ct_Chng_Q4_Q1 3.891 Avg_Utilization_Ratio 0.000 dtype: float64
X.head()
| Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
#Split into train/validation/test before imputing
# first we split data into 2 parts, say temporary and test
# USE STRATIFY since unbalanced
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 19) (2026, 19) (2026, 19)
X_train.head()
| Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 800 | 40 | M | 2 | NaN | Single | $120K + | Blue | 21 | 6 | 4 | 3 | 20056.000 | 1602 | 18454.000 | 0.466 | 1687 | 46 | 0.533 | 0.080 |
| 498 | 44 | M | 1 | NaN | Married | NaN | Blue | 34 | 6 | 2 | 0 | 2885.000 | 1895 | 990.000 | 0.387 | 1366 | 31 | 0.632 | 0.657 |
| 4356 | 48 | M | 4 | High School | Married | $80K - $120K | Blue | 36 | 5 | 1 | 2 | 6798.000 | 2517 | 4281.000 | 0.873 | 4327 | 79 | 0.881 | 0.370 |
| 407 | 41 | M | 2 | Graduate | NaN | $60K - $80K | Silver | 36 | 6 | 2 | 0 | 27000.000 | 0 | 27000.000 | 0.610 | 1209 | 39 | 0.300 | 0.000 |
| 8728 | 46 | M | 4 | High School | Divorced | $40K - $60K | Silver | 36 | 2 | 2 | 3 | 15034.000 | 1356 | 13678.000 | 0.754 | 7737 | 84 | 0.750 | 0.090 |
#- Dummies can be created for Marital_status, Gender
# Do separately for Train, Val and Test
# DO AFTER SPLIT
X_train = pd.get_dummies(data=X_train, drop_first=True)
X_val = pd.get_dummies(data=X_val, drop_first=True)
X_test = pd.get_dummies(data=X_test, drop_first=True)
column_names = X_train.columns.tolist()
# Use column names as feature names
feature_names = column_names
X_train.head()
| Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Gender_M | Education_Level_Doctorate | Education_Level_Graduate | Education_Level_High School | Education_Level_Post-Graduate | Education_Level_Uneducated | Marital_Status_Married | Marital_Status_Single | Income_Category_$40K - $60K | Income_Category_$60K - $80K | Income_Category_$80K - $120K | Income_Category_Less than $40K | Card_Category_Gold | Card_Category_Platinum | Card_Category_Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 800 | 40 | 2 | 21 | 6 | 4 | 3 | 20056.000 | 1602 | 18454.000 | 0.466 | 1687 | 46 | 0.533 | 0.080 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 498 | 44 | 1 | 34 | 6 | 2 | 0 | 2885.000 | 1895 | 990.000 | 0.387 | 1366 | 31 | 0.632 | 0.657 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4356 | 48 | 4 | 36 | 5 | 1 | 2 | 6798.000 | 2517 | 4281.000 | 0.873 | 4327 | 79 | 0.881 | 0.370 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 407 | 41 | 2 | 36 | 6 | 2 | 0 | 27000.000 | 0 | 27000.000 | 0.610 | 1209 | 39 | 0.300 | 0.000 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 8728 | 46 | 4 | 36 | 2 | 2 | 3 | 15034.000 | 1356 | 13678.000 | 0.754 | 7737 | 84 | 0.750 | 0.090 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
# Do separately for train/test/validation to avoid data leaks - impute Education_Level, Marital_status
"""
Creating dummies later instead of doing there:
#- Replace strings with numbers for : Education_Level
# Replace strings with numbers for Income_category
# Replace abc with NaN
# Replace strings with numbers for Card_Category
"""
#impute_cols = ["Education_Level", "Marital_Status", "Income_Category"]
### IMPUTE TRAINING DATA ####
# Instantiate the IterativeImputer
training_imputer = IterativeImputer(initial_strategy="most_frequent", max_iter=100, random_state=1)
# Fit the imputer to the data
training_imputer.fit(X_train)
# Transform the data to impute missing values
X_train = training_imputer.transform(X_train)
### IMPUTE VALIDATION DATA ###
# Repeat separately for validation dataset
# Transform the data to impute missing values
X_val = training_imputer.transform(X_val)
### IMPUTE TEST DATA ####
# Repeat separately for test dataset
# Transform the data to impute missing values
X_test = training_imputer.transform(X_test)
# Outliers - Future Improvement - Ignore for now
# Function to perform capping based on IQR and apply log transformation
"""
def cap_and_log(column, caplog, applylog):
if caplog == 1:
# Calculate quartiles
Q1 = column.quantile(0.25)
Q3 = column.quantile(0.75)
# Calculate IQR
IQR = Q3 - Q1
# Calculate upper and lower bounds
upper_bound = Q3 + 2 * IQR
lower_bound = Q1 - 2 * IQR
# Identify values exceeding or below 2*IQR
exceed_values = column[column > upper_bound]
lower_values = column[column < lower_bound]
# Cap values exceeding or below 2*IQR
column.loc[column > upper_bound] = upper_bound
column.loc[column < lower_bound] = lower_bound
if applylog == 1:
# Apply log to all values
column = np.log(column)
# Apply log transformation to capped values
#column.loc[exceed_values.index] = np.log(exceed_values)
#column.loc[lower_values.index] = np.log(lower_values)
return column
# Replace Outliers with 2*IQR for Total_Trans_Ct , Months_on_book
# Replace Outliers with 2*IQR for Months_Inactuve_12_mon , Contacts_Count_12_mon
# APply log transformtation on right-skewed data / outliers > 2*IQR for Credit_Limit, Total_Trans_Ct, Total_Trans_Amt, Total_Ct_Chng_Q4_Q1, Total_Amt_Cng_Q4_Q1
# Apply Capping and NO LOG
X_train["Total_Trans_Ct"] = cap_and_log(X_train["Total_Trans_Ct"], 1, 0)
X_train["Months_on_book"] = cap_and_log(X_train["Months_on_book"], 1, 0)
# Apply Capping and NO LOG
X_train["Months_Inactive_12_mon"] = (cap_and_log(X_train["Months_Inactive_12_mon"], 1, 0))
X_train["Contacts_Count_12_mon"] = (cap_and_log(X_train["Contacts_Count_12_mon"], 1, 0))
# Apply cap_and_log function to each column
# Do separately the log ( XTest and XVal)
# Apply NO capping and log to XTrain
X_train["Credit_Limit"] = (cap_and_log(X_train["Credit_Limit"], 0, 1))
X_train["Total_Trans_Ct"] = (cap_and_log(X_train["Total_Trans_Ct"], 0, 1))
X_train["Total_Trans_Amt"] = (cap_and_log(X_train["Total_Trans_Amt"], 0, 1))
X_train["Total_Ct_Chng_Q4_Q1"] = (cap_and_log(X_train["Total_Ct_Chng_Q4_Q1"], 0, 1))
X_train["Total_Amt_Chng_Q4_Q1"] = (cap_and_log(X_train["Total_Amt_Chng_Q4_Q1"], 0, 1))
# Apply NO capping , only log to Xval and Xtest (so as not to change the val and test data)
X_val["Credit_Limit"] = (cap_and_log(X_val["Credit_Limit"],0, 1))
X_val["Total_Trans_Ct"] = (cap_and_log(X_val["Total_Trans_Ct"], 0, 1))
X_val["Total_Trans_Amt"] = (cap_and_log(X_val["Total_Trans_Amt"], 0, 1))
X_val["Total_Ct_Chng_Q4_Q1"] = (cap_and_log(X_val["Total_Ct_Chng_Q4_Q1"], 0, 1))
X_val["Total_Amt_Chng_Q4_Q1"] = (cap_and_log(X_val["Total_Amt_Chng_Q4_Q1"],0, 1))
X_test["Credit_Limit"] = (cap_and_log(X_test["Credit_Limit"],0, 1))
X_test["Total_Trans_Ct"] = (cap_and_log(X_test["Total_Trans_Ct"], 0, 1))
X_test["Total_Trans_Amt"] = (cap_and_log(X_test["Total_Trans_Amt"], 0, 1))
X_test["Total_Ct_Chng_Q4_Q1"] = (cap_and_log(X_test["Total_Ct_Chng_Q4_Q1"], 0, 1))
X_test["Total_Amt_Chng_Q4_Q1"] = (cap_and_log(X_test["Total_Amt_Chng_Q4_Q1"],0, 1))
"""
# No treatment of outliers done for Age, Income (happen t o also be categorical)
'\ndef cap_and_log(column, caplog, applylog):\n\n if caplog == 1:\n # Calculate quartiles\n Q1 = column.quantile(0.25)\n Q3 = column.quantile(0.75)\n\n # Calculate IQR\n IQR = Q3 - Q1\n\n # Calculate upper and lower bounds\n upper_bound = Q3 + 2 * IQR\n lower_bound = Q1 - 2 * IQR\n\n # Identify values exceeding or below 2*IQR\n exceed_values = column[column > upper_bound]\n lower_values = column[column < lower_bound]\n\n # Cap values exceeding or below 2*IQR\n column.loc[column > upper_bound] = upper_bound\n column.loc[column < lower_bound] = lower_bound\n\n if applylog == 1:\n # Apply log to all values\n column = np.log(column)\n\n # Apply log transformation to capped values\n #column.loc[exceed_values.index] = np.log(exceed_values)\n #column.loc[lower_values.index] = np.log(lower_values)\n\n return column\n\n# Replace Outliers with 2*IQR for Total_Trans_Ct , Months_on_book\n# Replace Outliers with 2*IQR for Months_Inactuve_12_mon , Contacts_Count_12_mon\n# APply log transformtation on right-skewed data / outliers > 2*IQR for Credit_Limit, Total_Trans_Ct, Total_Trans_Amt, Total_Ct_Chng_Q4_Q1, Total_Amt_Cng_Q4_Q1\n\n\n# Apply Capping and NO LOG\nX_train["Total_Trans_Ct"] = cap_and_log(X_train["Total_Trans_Ct"], 1, 0)\nX_train["Months_on_book"] = cap_and_log(X_train["Months_on_book"], 1, 0)\n\n# Apply Capping and NO LOG\nX_train["Months_Inactive_12_mon"] = (cap_and_log(X_train["Months_Inactive_12_mon"], 1, 0))\nX_train["Contacts_Count_12_mon"] = (cap_and_log(X_train["Contacts_Count_12_mon"], 1, 0))\n\n# Apply cap_and_log function to each column\n# Do separately the log ( XTest and XVal)\n# Apply NO capping and log to XTrain\nX_train["Credit_Limit"] = (cap_and_log(X_train["Credit_Limit"], 0, 1))\nX_train["Total_Trans_Ct"] = (cap_and_log(X_train["Total_Trans_Ct"], 0, 1))\nX_train["Total_Trans_Amt"] = (cap_and_log(X_train["Total_Trans_Amt"], 0, 1))\nX_train["Total_Ct_Chng_Q4_Q1"] = (cap_and_log(X_train["Total_Ct_Chng_Q4_Q1"], 0, 1))\nX_train["Total_Amt_Chng_Q4_Q1"] = (cap_and_log(X_train["Total_Amt_Chng_Q4_Q1"], 0, 1))\n# Apply NO capping , only log to Xval and Xtest (so as not to change the val and test data)\nX_val["Credit_Limit"] = (cap_and_log(X_val["Credit_Limit"],0, 1))\nX_val["Total_Trans_Ct"] = (cap_and_log(X_val["Total_Trans_Ct"], 0, 1))\nX_val["Total_Trans_Amt"] = (cap_and_log(X_val["Total_Trans_Amt"], 0, 1))\nX_val["Total_Ct_Chng_Q4_Q1"] = (cap_and_log(X_val["Total_Ct_Chng_Q4_Q1"], 0, 1))\nX_val["Total_Amt_Chng_Q4_Q1"] = (cap_and_log(X_val["Total_Amt_Chng_Q4_Q1"],0, 1))\n\nX_test["Credit_Limit"] = (cap_and_log(X_test["Credit_Limit"],0, 1))\nX_test["Total_Trans_Ct"] = (cap_and_log(X_test["Total_Trans_Ct"], 0, 1))\nX_test["Total_Trans_Amt"] = (cap_and_log(X_test["Total_Trans_Amt"], 0, 1))\nX_test["Total_Ct_Chng_Q4_Q1"] = (cap_and_log(X_test["Total_Ct_Chng_Q4_Q1"], 0, 1))\nX_test["Total_Amt_Chng_Q4_Q1"] = (cap_and_log(X_test["Total_Amt_Chng_Q4_Q1"],0, 1))\n'
# Check if anything is wrong
#X_train.head()
#Recheck number of outliers reduced
# checking the % outliers atleast in XTrain
# Calculate quartiles
"""
Q1 = X_train.quantile(0.25)
Q3 = X_train.quantile(0.75)
# Calculate IQR
IQR = Q3 - Q1
# Calculate upper and lower bounds
upper_bound = Q3 + 2 * IQR
lower_bound = Q1 - 2 * IQR
print(((X_train.select_dtypes(include=["float64", "int64"]) < lower) | (X_train.select_dtypes(include=["float64", "int64"]) > upper)).sum() / len(X_train) * 100)
"""
# Checked missing values handled
#print(X_train.isna().sum())
#print(X_val.isna().sum())
#print(X_test.isna().sum())
'\nQ1 = X_train.quantile(0.25)\nQ3 = X_train.quantile(0.75)\n\n# Calculate IQR\nIQR = Q3 - Q1\n\n# Calculate upper and lower bounds\nupper_bound = Q3 + 2 * IQR\nlower_bound = Q1 - 2 * IQR\n\nprint(((X_train.select_dtypes(include=["float64", "int64"]) < lower) | (X_train.select_dtypes(include=["float64", "int64"]) > upper)).sum() / len(X_train) * 100)\n'
The nature of predictions made by the classification model will translate as follows:
Which metric to optimize?
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
roc_auc = roc_auc_score(target, pred)
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
"ROC-AUC": roc_auc,
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Sample code for model building with original data
models = [] # Empty list to store all the models
train_scores = [] # Empty list to store all model's recall scores
names = [] # Empty list to store name of the models
val_scores = []
#'_______' ## Complete the code to append remaining 3 models in the list models
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("XGB", XGBClassifier(random_state=1)))
models.append(("DT", DecisionTreeClassifier(random_state=1)))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_train, model.predict(X_train))
train_scores.append(scores)
names.append(name)
print("{}: {}".format(name, scores))
#for name, model in models:
# model_performance_classification_sklearn(model, X_train, y_train)
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores_val = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores_val))
val_scores.append(scores)
#confusion_matrix_sklearn(model, X_val, y_val)
#for name, model in models:
# model_performance_classification_sklearn(model, X_val, y_val)
Training Performance: Bagging: 0.985655737704918 Random forest: 1.0 Adaboost: 0.8381147540983607 GBM: 0.8780737704918032 XGB: 1.0 DT: 1.0 Validation Performance: Bagging: 0.8128834355828221 Random forest: 0.7944785276073619 Adaboost: 0.8588957055214724 GBM: 0.8588957055214724 XGB: 0.8650306748466258 DT: 0.8098159509202454
for name, model in models:
print("Model Name:", name)
print(model_performance_classification_sklearn(model, X_train, y_train))
print("")
print("")
Model Name: Bagging Accuracy Recall Precision F1 ROC-AUC 0 0.997 0.986 0.997 0.991 0.993 Model Name: Random forest Accuracy Recall Precision F1 ROC-AUC 0 1.000 1.000 1.000 1.000 1.000 Model Name: Adaboost Accuracy Recall Precision F1 ROC-AUC 0 0.959 0.838 0.897 0.867 0.910 Model Name: GBM Accuracy Recall Precision F1 ROC-AUC 0 0.974 0.878 0.954 0.915 0.935 Model Name: XGB Accuracy Recall Precision F1 ROC-AUC 0 1.000 1.000 1.000 1.000 1.000 Model Name: DT Accuracy Recall Precision F1 ROC-AUC 0 1.000 1.000 1.000 1.000 1.000
for name, model in models:
print("Model Name:", name)
print(model_performance_classification_sklearn(model, X_val, y_val))
print("")
print("")
Model Name: Bagging Accuracy Recall Precision F1 ROC-AUC 0 0.955 0.813 0.895 0.852 0.897 Model Name: Random forest Accuracy Recall Precision F1 ROC-AUC 0 0.958 0.794 0.932 0.858 0.892 Model Name: Adaboost Accuracy Recall Precision F1 ROC-AUC 0 0.960 0.859 0.889 0.874 0.919 Model Name: GBM Accuracy Recall Precision F1 ROC-AUC 0 0.968 0.859 0.936 0.896 0.924 Model Name: XGB Accuracy Recall Precision F1 ROC-AUC 0 0.966 0.865 0.919 0.891 0.925 Model Name: DT Accuracy Recall Precision F1 ROC-AUC 0 0.938 0.810 0.807 0.809 0.886
for name, model in models:
print ("TRAIN MODEL NAME: ", name)
confusion_matrix_sklearn(model, X_train, y_train)
plt.show()
TRAIN MODEL NAME: Bagging
TRAIN MODEL NAME: Random forest
TRAIN MODEL NAME: Adaboost
TRAIN MODEL NAME: GBM
TRAIN MODEL NAME: XGB
TRAIN MODEL NAME: DT
for name, model in models:
print ("VAL MODEL NAME: ", name)
confusion_matrix_sklearn(model, X_val, y_val)
plt.show()
VAL MODEL NAME: Bagging
VAL MODEL NAME: Random forest
VAL MODEL NAME: Adaboost
VAL MODEL NAME: GBM
VAL MODEL NAME: XGB
VAL MODEL NAME: DT
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("Before OverSampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, count of label '0': {} \n".format(sum(y_train == 0)))
print("After OverSampling, count of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, count of label '0': {} \n".format(sum(y_train_over == 0)))
print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
# REDO MODEL FIT AND VAL for OVERSAMPLED DATA
over_models = []
over_models.append(("Bagging", BaggingClassifier(random_state=1)))
over_models.append(("Random forest", RandomForestClassifier(random_state=1)))
over_models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
over_models.append(("GBM", GradientBoostingClassifier(random_state=1)))
over_models.append(("XGB", XGBClassifier(random_state=1)))
over_models.append(("DT", DecisionTreeClassifier(random_state=1)))
print("\n" "Training Performance:" "\n")
for name, model in over_models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_train_over, model.predict(X_train_over))
print("{}: {}".format(name, scores))
print("\n" "Validation Performance:" "\n")
for name, model in over_models:
model.fit(X_train_over, y_train_over)
scores_val = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores_val))
Before OverSampling, count of label '1': 976 Before OverSampling, count of label '0': 5099 After OverSampling, count of label '1': 5099 After OverSampling, count of label '0': 5099 After OverSampling, the shape of train_X: (10198, 29) After OverSampling, the shape of train_y: (10198,) Training Performance: Bagging: 0.9984310649146891 Random forest: 1.0 Adaboost: 0.9705824671504216 GBM: 0.9805844283192783 XGB: 1.0 DT: 1.0 Validation Performance: Bagging: 0.843558282208589 Random forest: 0.8251533742331288 Adaboost: 0.8834355828220859 GBM: 0.8865030674846626 XGB: 0.8895705521472392 DT: 0.8374233128834356
for name, model in models:
print("Model Name:", name)
print(model_performance_classification_sklearn(model, X_val, y_val))
print("")
print("")
Model Name: Bagging Accuracy Recall Precision F1 ROC-AUC 0 0.955 0.813 0.895 0.852 0.897 Model Name: Random forest Accuracy Recall Precision F1 ROC-AUC 0 0.958 0.794 0.932 0.858 0.892 Model Name: Adaboost Accuracy Recall Precision F1 ROC-AUC 0 0.960 0.859 0.889 0.874 0.919 Model Name: GBM Accuracy Recall Precision F1 ROC-AUC 0 0.968 0.859 0.936 0.896 0.924 Model Name: XGB Accuracy Recall Precision F1 ROC-AUC 0 0.966 0.865 0.919 0.891 0.925 Model Name: DT Accuracy Recall Precision F1 ROC-AUC 0 0.938 0.810 0.807 0.809 0.886
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
# REDO MODEL FIT AND VAL for UNDERSAMPLED DATA
under_models = []
under_models.append(("Bagging", BaggingClassifier(random_state=1)))
under_models.append(("Random forest", RandomForestClassifier(random_state=1)))
under_models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
under_models.append(("GBM", GradientBoostingClassifier(random_state=1)))
under_models.append(("XGB", XGBClassifier(random_state=1)))
under_models.append(("DT", DecisionTreeClassifier(random_state=1)))
print("\n" "Training Performance:" "\n")
for name, model in under_models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_train_un, model.predict(X_train_un))
print("{}: {}".format(name, scores))
print("\n" "Validation Performance:" "\n")
for name, model in under_models:
model.fit(X_train_un, y_train_un)
scores_val = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores_val))
Before Under Sampling, counts of label 'Yes': 976 Before Under Sampling, counts of label 'No': 5099 After Under Sampling, counts of label 'Yes': 976 After Under Sampling, counts of label 'No': 976 After Under Sampling, the shape of train_X: (1952, 29) After Under Sampling, the shape of train_y: (1952,) Training Performance: Bagging: 0.9918032786885246 Random forest: 1.0 Adaboost: 0.9528688524590164 GBM: 0.9795081967213115 XGB: 1.0 DT: 1.0 Validation Performance: Bagging: 0.9355828220858896 Random forest: 0.9386503067484663 Adaboost: 0.9601226993865031 GBM: 0.9662576687116564 XGB: 0.9693251533742331 DT: 0.901840490797546
for name, model in models:
print("Model Name:", name)
print(model_performance_classification_sklearn(model, X_val, y_val))
print("")
print("")
Model Name: Bagging Accuracy Recall Precision F1 ROC-AUC 0 0.955 0.813 0.895 0.852 0.897 Model Name: Random forest Accuracy Recall Precision F1 ROC-AUC 0 0.958 0.794 0.932 0.858 0.892 Model Name: Adaboost Accuracy Recall Precision F1 ROC-AUC 0 0.960 0.859 0.889 0.874 0.919 Model Name: GBM Accuracy Recall Precision F1 ROC-AUC 0 0.968 0.859 0.936 0.896 0.924 Model Name: XGB Accuracy Recall Precision F1 ROC-AUC 0 0.966 0.865 0.919 0.891 0.925 Model Name: DT Accuracy Recall Precision F1 ROC-AUC 0 0.938 0.810 0.807 0.809 0.886
1) Best validation performance on UNDERSAMPLED data - XGB with Undersampled data provides the best Recall in validation set. And it also does not vary much from the Recall on the Training set (ie, very less overfitting). XGB Recall: 0.9693251533742331
2) Best validation performance on ORIGINAL data - XGB - Did best on training and also closest of the ones that did best on training to validation set XGB Recall: 0.8895705521472392
3) Best validation performance on OVERSAMPLED data- XGB. Did best on training and also closest among the ones that did best on training to validation set XGB Recall: 0.8650306748466258
Note
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
param_grid = {
'max_samples': [0.8,0.9,1],
'max_features': [0.7,0.8,0.9],
'n_estimators' : [30,50,70],
}
param_grid = {
"n_estimators": [50,110,25],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)
}
param_grid = {
'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10, 15],
'min_impurity_decrease': [0.0001,0.001]
}
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
# Define the parameter grid
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
# Define the scorer
scorer = metrics.make_scorer(metrics.recall_score)
# Create an classifier
xgb_orig = XGBClassifier(random_state=1)
# Create RandomizedSearchCV instance
randomized_cv_xgb_orig = RandomizedSearchCV(
estimator=xgb_orig,
param_distributions=param_grid,
n_iter=10,
scoring=scorer,
cv=5,
random_state=1,
n_jobs=-1
)
# Fit the RandomizedSearchCV instance on the oversampled data
randomized_cv_xgb_orig.fit(X_train, y_train)
# Print the best parameters and best score
print("Best parameters found:", randomized_cv_xgb_orig.best_params_)
print("Best recall score found:", randomized_cv_xgb_orig.best_score_)
Best parameters found: {'subsample': 0.7, 'scale_pos_weight': 5, 'n_estimators': 75, 'learning_rate': 0.05, 'gamma': 3}
Best recall score found: 0.9190371533228676
# TUNE our MODEL WITH the BEST PARAMS found, random_state=1
xgb_orig_tuned = XGBClassifier( random_state=1,
subsample= 0.7, scale_pos_weight= 5, n_estimators= 75, learning_rate= 0.05, gamma= 3
)
# Fit the model on training data
xgb_orig_tuned.fit(X_train, y_train)
# Call model_performance_classification_sklearn using above model on TRAIN AND VAL
# Calculating different metrics on train set
xgb_orig_grid_train = model_performance_classification_sklearn(
xgb_orig_tuned, X_train, y_train
)
xgb_orig_grid_val = model_performance_classification_sklearn(
xgb_orig_tuned, X_val, y_val
)
# Define the parameter grid
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
# Define the scorer
scorer = metrics.make_scorer(metrics.recall_score)
# Create an classifier
xgb_over = XGBClassifier(random_state=1)
# Create RandomizedSearchCV instance
randomized_cv_xgb_over = RandomizedSearchCV(
estimator=xgb_over,
param_distributions=param_grid,
n_iter=10,
scoring=scorer,
cv=5,
random_state=1,
n_jobs=-1
)
# Fit the RandomizedSearchCV instance on the oversampled data
randomized_cv_xgb_over.fit(X_train_over, y_train_over)
# Print the best parameters and best score
print("Best parameters found:", randomized_cv_xgb_over.best_params_)
print("Best recall score found:", randomized_cv_xgb_over.best_score_)
Best parameters found: {'subsample': 0.7, 'scale_pos_weight': 5, 'n_estimators': 75, 'learning_rate': 0.05, 'gamma': 3}
Best recall score found: 0.9798039215686274
# TUNE our MODEL WITH the BEST PARAMS found, random_state=1
xgb_over_tuned = XGBClassifier( random_state=1,
subsample= 0.7, scale_pos_weight= 5, n_estimators= 75, learning_rate= 0.05, gamma= 3
)
# Fit the model on training data
xgb_over_tuned.fit(X_train_over, y_train_over)
# Call model_performance_classification_sklearn using above model on TRAIN AND VAL
# Calculating different metrics on train set
xgb_over_grid_train = model_performance_classification_sklearn(
xgb_over_tuned, X_train_over, y_train_over
)
xgb_over_grid_val = model_performance_classification_sklearn(
xgb_over_tuned, X_val, y_val
)
# Define the parameter grid
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
# Define the scorer
scorer = metrics.make_scorer(metrics.recall_score)
# Create an classifier
xgb_under = XGBClassifier(random_state=1)
# Create RandomizedSearchCV instance
randomized_cv_xgb_under = RandomizedSearchCV(
estimator=xgb_under,
param_distributions=param_grid,
n_iter=10,
scoring=scorer,
cv=5,
random_state=1,
n_jobs=-1
)
# Fit the RandomizedSearchCV instance on the oversampled data
randomized_cv_xgb_under.fit(X_train_un, y_train_un)
# Print the best parameters and best score
print("Best parameters found:", randomized_cv_xgb_under.best_params_)
print("Best recall score found:", randomized_cv_xgb_under.best_score_)
Best parameters found: {'subsample': 0.7, 'scale_pos_weight': 5, 'n_estimators': 75, 'learning_rate': 0.05, 'gamma': 3}
Best recall score found: 0.9774882260596547
# TUNE our MODEL WITH the BEST PARAMS found, random_state=1
xgb_under_tuned = XGBClassifier( random_state=1,
subsample= 0.7, scale_pos_weight= 5, n_estimators= 75, learning_rate= 0.05, gamma= 3
)
# Fit the model on training data
xgb_under_tuned.fit(X_train_un, y_train_un)
# Call model_performance_classification_sklearn using above model on TRAIN AND VAL
# Calculating different metrics on train set
xgb_under_grid_train = model_performance_classification_sklearn(
xgb_under_tuned, X_train_un, y_train_un
)
xgb_under_grid_val = model_performance_classification_sklearn(
xgb_under_tuned, X_val, y_val
)
# training performance comparison
models_train_comp_df = pd.concat(
[
xgb_orig_grid_train.T,
xgb_over_grid_train.T,
xgb_under_grid_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"XGB_hypertuned_RandomSearchCV_ORIG_Data",
"XGB_hypertuned_RandomSearchCV_Oversampled_Data",
"XGB_hypertuned_RandomSearchCV_Undersampled_Data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| XGB_hypertuned_RandomSearchCV_ORIG_Data | XGB_hypertuned_RandomSearchCV_Oversampled_Data | XGB_hypertuned_RandomSearchCV_Undersampled_Data | |
|---|---|---|---|
| Accuracy | 0.974 | 0.974 | 0.960 |
| Recall | 0.995 | 1.000 | 1.000 |
| Precision | 0.865 | 0.951 | 0.925 |
| F1 | 0.925 | 0.974 | 0.961 |
| ROC-AUC | 0.983 | 0.974 | 0.960 |
# VAL performance comparison
models_val_comp_df = pd.concat(
[
xgb_orig_grid_val.T,
xgb_over_grid_val.T,
xgb_under_grid_val.T,
],
axis=1,
)
models_val_comp_df.columns = [
"XGB_hypertuned_RandomSearchCV_ORIG_Data",
"XGB_hypertuned_RandomSearchCV_Oversampled_Data",
"XGB_hypertuned_RandomSearchCV_Undersampled_Data",
]
print("VALIDATION performance comparison:")
models_val_comp_df
VALIDATION performance comparison:
| XGB_hypertuned_RandomSearchCV_ORIG_Data | XGB_hypertuned_RandomSearchCV_Oversampled_Data | XGB_hypertuned_RandomSearchCV_Undersampled_Data | |
|---|---|---|---|
| Accuracy | 0.956 | 0.936 | 0.896 |
| Recall | 0.939 | 0.948 | 0.988 |
| Precision | 0.816 | 0.734 | 0.610 |
| F1 | 0.873 | 0.827 | 0.754 |
| ROC-AUC | 0.949 | 0.941 | 0.933 |
XGBoost with Oversampled Data and then hypertuned with parameters obtained from RandomSearchCV
Reason: Highest Recall and Least Overfitting with Validation Set Performance closest to Training Set performance. Also, F1, ROC-AIC are good. This means it is best for this ussecase
Additionally, other metrics are also highest: Meaning it will perform good also on the other aspects like False positive, etc..
Model_test = model_performance_classification_sklearn(xgb_over_tuned, X_test, y_test)
Model_test
| Accuracy | Recall | Precision | F1 | ROC-AUC | |
|---|---|---|---|---|---|
| 0 | 0.939 | 0.972 | 0.733 | 0.836 | 0.952 |
feature_importances = xgb_over_tuned.feature_importances_
for i, importance in enumerate(feature_importances):
print(f"Feature {feature_names[i]}: Importance = {importance}")
Feature Customer_Age: Importance = 0.01980598457157612 Feature Dependent_count: Importance = 0.01788775995373726 Feature Months_on_book: Importance = 0.01357334479689598 Feature Total_Relationship_Count: Importance = 0.04883166775107384 Feature Months_Inactive_12_mon: Importance = 0.04932628571987152 Feature Contacts_Count_12_mon: Importance = 0.03035697154700756 Feature Credit_Limit: Importance = 0.01896979659795761 Feature Total_Revolving_Bal: Importance = 0.053592126816511154 Feature Avg_Open_To_Buy: Importance = 0.029062069952487946 Feature Total_Amt_Chng_Q4_Q1: Importance = 0.03274666517972946 Feature Total_Trans_Amt: Importance = 0.07748861610889435 Feature Total_Trans_Ct: Importance = 0.23960600793361664 Feature Total_Ct_Chng_Q4_Q1: Importance = 0.0389585867524147 Feature Avg_Utilization_Ratio: Importance = 0.03209696337580681 Feature Gender_M: Importance = 0.0447402261197567 Feature Education_Level_Doctorate: Importance = 0.005759861320257187 Feature Education_Level_Graduate: Importance = 0.016655225306749344 Feature Education_Level_High School: Importance = 0.020260360091924667 Feature Education_Level_Post-Graduate: Importance = 0.012197164818644524 Feature Education_Level_Uneducated: Importance = 0.014489570632576942 Feature Marital_Status_Married: Importance = 0.05926361680030823 Feature Marital_Status_Single: Importance = 0.02079702913761139 Feature Income_Category_$40K - $60K: Importance = 0.011936217546463013 Feature Income_Category_$60K - $80K: Importance = 0.0099178496748209 Feature Income_Category_$80K - $120K: Importance = 0.02144428715109825 Feature Income_Category_Less than $40K: Importance = 0.019648198038339615 Feature Card_Category_Gold: Importance = 0.005476157646626234 Feature Card_Category_Platinum: Importance = 0.031135471537709236 Feature Card_Category_Silver: Importance = 0.0039758929051458836
indices = np.argsort(feature_importances)
#indices.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), feature_importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Target customers that come under the list of 3 most important features: Total_Trans_Ct, Total_Tran_Amt, Marital_Status_Married
Target customers that fall in the below categories :
From EDA: From the bivariate analyses : Following group of customers are more likely to leave: